Text Line Extraction from Complex Layout Documents
نویسندگان
چکیده
There are numerous stylish documents which do not have the traditional text layouts where printed text regions are not parallel to each other. Such complex layouts make text line extraction challenging due to multi-orientation of paragraphs. This paper introduces a system for the text line extraction from the complex layout documents. Proposed method is based on the concept of dilation and histogram profiling. The text regions are extracted using dilation and food fill based approach, then paragraph orientation is determined and individual text lines are extracted. The accuracy of extracted text lines are evaluated using the new proposed concept that is also based on the histogram profiling. The results of proposed approach on the complex layouts are promising. General Terms Document Analysis and Recognition, Optical Character Recognition.
منابع مشابه
Information Extraction from Document Images using Attention Based Layout Segmentation
Introduction The attention of a human reader and the reading speed strongly depends on the layout of a document The term layout is used for the geometrical arrangement of document components (i.e. text, graphics and figures) on the page as well as for the typographic features of the text (i.e. font type, style, size, alignment and line spacing). Although the human visual and cognitive perceptio...
متن کاملAutomatic Detection of Font Size Straight from Run Length Compressed Text Documents
Automatic detection of font size finds many applications in the area of intelligent OCRing and document image analysis, which has been traditionally practised over uncompressed documents, although in real life the documents exist in compressed form for efficient storage and transmission. It would be novel and intelligent if the task of font size detection could be carried out directly from the ...
متن کاملDetection, Extraction and Representation of Tables
We are concerned with the extraction of tables from exchange format representations of very diverse composite documents. We put forward a flexible representation scheme for complex tables, based on a clear distinction between the physical layout of a table and its logical structure. Relying on this scheme, we develop a new method for the detection and the extraction of tables by an analysis of ...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملNatural Language Inspired Approach for Handwritten Text Line Detection in Legacy Documents
Document layout analysis is an important task needed for handwritten text recognition among other applications. Text layout commonly found in handwritten legacy documents is in the form of one or more paragraphs composed of parallel text lines. An approach for handwritten text line detection is presented which uses machinelearning techniques and methods widely used in natural language processin...
متن کامل